Statistical Approach to for predicting IMDB

  1. Introduction
    1.1 Background
    1.2 Data Description
    1.3 Problem Statement
  2. Data Exploration
    2.1 Data Loading
    2.2 Data Profile
    2.3 Data Cleaning
  3. Regression Model Building
    3.1 Splitting the Dataset
    3.2 Scaling to avoid Euclidean Distance problem
    3.3 Feature Elimination
    3.4 Simple Linear Regression
    3.5 Support Vector Machines with Linear, Polynomial and RBF Kernels
    3.6 Ensemble Models
    3.6.1 Gradient Boosting with Hyperparameter Tuning
    3.6.2 Random Forest with Hyperparameter Tuning
    3.7 XGBoost with Hyperparameter Tuning
    3.8 Interpreting Results of a Regresison Model
  4. Building a Classificaiton Model
    4.1 Logistic Regression
    4.2 Support Vector machines with Linear, Polynomial adn RBF Kernels
    4.3 Ensemble Models
    4.3.1 Random Forest with Hyperparameter Tuning
    4.3.2 Gradient Boosting with Hyperparameter Tuning
    4.4 XGBoost with Hyperparameter Tuning
    4.5 Interpreting Results of Classification Model
  5. Conclusion

1 Introduction

1.1 Background

A commercial success movie not only entertains audience, but also enables film companies to gain tremendous profit. A lot of factors such as good directors, experienced actors are considerable for creating good movies. However, famous directors and actors can always bring an expected box-office income but cannot guarantee a highly rated imdb score.

1.2 Data Description

The dataset is from Kaggle website. It contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. “imdb_score” is the response variable while the other 27 variables are possible predictors.

Variable Name Description
movie_title Title of the Movie
duration Duration in minutes
director_name Name of the Director of the Movie
director_facebook_likes Number of likes of the Director on his Facebook Page
actor_1_name Primary actor starring in the movie
actor_1_facebook_likes Number of likes of the Actor_1 on his/her Facebook Page
actor_2_name Other actor starring in the movie
actor_2_facebook_likes Number of likes of the Actor_2 on his/her Facebook Page
actor_3_name Other actor starring in the movie
actor_3_facebook_likes Number of likes of the Actor_3 on his/her Facebook Page
num_user_for_reviews Number of users who gave a review
num_critic_for_reviews Number of critical reviews on imdb
num_voted_users Number of people who voted for the movie
cast_total_facebook_likes Total number of facebook likes of the entire cast of the movie
movie_facebook_likes Number of Facebook likes in the movie page
plot_keywords Keywords describing the movie plot
facenumber_in_poster Number of the actor who featured in the movie poster
color Film colorization. ‘Black and White’ or ‘Color’
genres Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’
title_year The year in which the movie is released (1916:2016)
language English, Arabic, Chinese, French, German, Danish, Italian, Japanese etc
country Country where the movie is produced
content_rating Content rating of the movie
aspect_ratio Aspect ratio the movie was made in
movie_imdb_link IMDB link of the movie
gross Gross earnings of the movie in Dollars
budget Budget of the movie in Dollars
imdb_score IMDB Score of the movie on IMDB

1.3 Problem Statement

Based on the massive movie information, it would be interesting to understand what are the important factors that make a movie more successful than others. So, we would like to analyze what kind of movies are more successful, in other words, get higher IMDB score.

In this notebook we are going to build two different kind of models, Regression and Classification. Under each kind of model we are going to start from a basic model to advanced model and also a description of why we choose advanced one.

Under Regression we are goint to fit Regression line to our data and find the continous target variable imdb_score.

Under Classification we are going to fit the Classification Model to our data and the Classify the imdb_score in to three categories.

imdb_score Classify
1-3 Flop Movie
3-6 Average Movie
6-10 Hit Movie

2. Data Exploration

2.1 Data Loading

2.2 Data Profile

2.3 Data Cleaning

Data Cleaning is a most important part of building a model. Here we do the standard preprocessing steps of the Data cleaning to make sure our model is not feeded crap.

2.3.1 Missing Value Treatment

2.3.2 Profile Report after missing value treatment

Dealing with Null Data amount we have lost 25% of the given data. Let's deal with converting the Data in to numericals to feed our model.

2.3.3 Converting Categoricals to Numericals to feed our model

Let us deal with the categorical_cols first by converting them in to numericals.

So as we see there are only 2 different categorical variables available in the color variable. We can just map color to 1 and 0 to black and white

The column genres has huge amount of values unique values. Let us divide this feature in to 2 different features with main_genre and the genres

Lets convert both the columns in to the numbericals. The main_genre and the genres

The variable actor_1_name is also having high cardinaity, hence we decide to change it in to the number of counts

As we see out of 3816 records, we have 3749 unique records which in not helpful for us for making predictions. So we drop the column from our dataframe

This variable also has high cadinality. So changing it in to the value counts variable.

Looking in to the variable, we can see has a high cardinality which is unstable and we can delete such variable and mainly, we need to extract the main_plot_keywords of all in it.

As we see the extracted main Plot keyword also consists of high cardinality but is stable. we can replace it with the value counts

This variable movie_imdb_link is however unique the whole. So considering it will not help out prediciting variable we drop it off.

Language variable has only 38 unique values and is consistent. So, we just do label encoding.

Country variable has only 47 unique values and is consistent. So, we just do label encoding.

Content rating has only 12 unique variables and can be done label encoding

2.3.4 Profile Report after data cleaning

As we look in to the profile report we are now having warnings of about the skewness and the zeros. This will be wiped off after doing a scaling operation after dealing with spiltting the dataset. All the unwanted variables will also be removed during the Feature elimination

3. Regression Model Building

3.1 Splitting the Dataset

3.2 Scaling to avoid Euclidean Distance Problem

We do scaling after we aplit the dataset as we donot want to make our training set metrics to fit the test set.

Building our model. As we are having many number of features, out of which there will be only some useful. Lets do some feature selection for our Regression model.

3.3 Feature Elimination

We dont want our model to feed with all the variables which might mot help in prediction. We do remove variables having High Collinearity and use only variables useful for our model by doing the Recursive Feature Elimination.

3.4 Simple Linear Regression

After looking in to the stats, We observe that the r2 score is low of about 0.37 aafter having all consistent variables and the regression line is not fitting the data correctly. So we have to go for much advanced curved model such as support vector machine and ensemble algorithms to make our model to fit the data correctly.

3.5 Support Vector Machines with Linear, Polynomial, RBF Kernels

3.6 Ensemble Models

3.6.1 Gradient Boosting with Hyper Parameter Tuning

3.6.2 Random Forest with Hyper Parameter Tuning

Lets tweek in to the hyperparameter tuning of the RandomForestRegressor to find the best parameters of the model

3.7 XGBoost with Hyperparameter tuning

3.8 Interpreting Results of Regression Model

Considering XG Boost as a final model with very less error rate.

After looking in to all the metrics almost we have seen that XGBRegressor with "{'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 500}" these parameters has given the best results with mean squared error of 0.404. The Feature Importance given by this model is shown above.

4. Building a Classification Model

To Build a classification Model I would like to reuse the preprocessed data from the Regression Model. However I am going to replace the target variable and create a new target variable for our classification Model.

imdb_score Classify

1-3 | Flop Movie 3-6 | Average Movie 6-10 | Hit Movie

We have created the target variable and now we will re use the independent variables form the Regression Model.

4.1 Logistic Regression

Logistic Regresion is a linear algorithm does basically a binary classification. In order to use the Logistic Regression for Multiclass Classification we need to use the parameter solver as 'saga'. There are also other parameters for solver to do multiclass classification, I used saga as it also does L2 regularisation.

4.2 Support Vector Classifier with Linear, Polynomial, RBF

Support Vector Classifier also basically does binary classification. In order to achieve the multi classification, we need to use the decision_function_shape as 'ovo'. The original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2)

4.3 Ensemble Models

4.3.1 Random Forest Classifier with Hyper Parameter tuning

4.3.2 Gradient Boost Classifier with Hyper Parameter Tuning

4.4 XG Boost Classifier with Hyper Parameter Tuning

As we see that the Gradient Boost with Hyper Parameter seems to give us the best Results. This is because the nature of Ensemble models tend to being overfitted. However we consider the final model for our classification as Gradient Boosting Classifier.

4.5 Interpreting Results of Classfication Model

Considering Gradient Boosting classifier as the final model with 83 % accuracy

5. Conclusion

After Looking in to the feature importance of the best models in the Regression and Classification Model we see that both the models have given almost the same amount of importance to the respective features, considering XGBosot Regressor and Gradient Boost Classiifier. The results of all Regression and Classification Models are as follows:

Regression Model Mean_squared_error
Simple Linear Regression 0.70
SVRegressor Linear 0.72
SVRegressor Polynomial 0.93
SVRegressor RBF 0.68
Gradient Boost 0.43
Random Forest 0.45
XGBoost 0.40
Classification Model MisClassifications Accuracy Precision Recall F1-Score
Logistic Regression 190 0.75 0.47 0.40 0.41
SVC Linear 181 0.76 0.47 0.45 0.46
SVC Polynomial 143 0.81 0.52 0.50 0.51
SVC RBF 146 0.81 0.51 0.50 0.50
Random Forest 130 0.83 0.54 0.50 0.51
Gradient Boosting 127 0.83 0.54 0.51 0.52
XGBoost 139 0.82 0.52 0.51 0.51